Spatial Partitioning Techniques in Spatial Hadoop

نویسندگان

  • Ahmed Eldawy
  • Louai Alarabi
  • Mohamed F. Mokbel
چکیده

SpatialHadoop is an extended MapReduce framework that supports global indexing that spatial partitions the data across machines providing orders of magnitude speedup, compared to traditional Hadoop. In this paper, we describe seven alternative partitioning techniques and experimentally study their effect on the quality of the generated index and the performance of range and spatial join queries. We found that using a 1% sample is enough to produce high quality partitions. Also, we found that the total area of partitions is a reasonable measure of the quality of indexes when running spatial join. This study will assist researchers in choosing a good spatial partitioning technique in distributed environments. 1. INDEXING IN SPATIALHADOOP SpatialHadoop [2, 3] provides a generic indexing algorithm which was used to implement grid, R-tree, and R+-tree based partitioning. This paper extends our previous study by introducing four new partitioning techniques, Z-curve, Hilbert curve, Quad tree, and K-d tree, and experimentally evaluate all of the seven techniques. The partitioning phase of the indexing algorithm runs in three steps, where the first step is fixed and the last two steps are customized for each partitioning technique. The first step computes number of desired partitions n based on file size and HDFS block capacity which are both fixed for all partitioning techniques. The second step reads a random sample, with a sampling ratio ρ, from the input file and uses this sample to partition the space into n cells such that number of sample points in each cell is at most ⌊k/n⌋, where k is the sample size. The third step actually partitions the file by assigning each record to one or more cells. Boundary objects are handled using either the distribution or replication methods. The distribution method assigns an object to exactly one overlapping cell and the cell has to be expanded to enclose all contained records. The replication method avoids expanding cells by replicating each record to all overlapping cells but the query processor has to employ a duplicate avoidance technique to account for replicated records. ∗This work is supported in part by the National Science Foundation, USA, under Grants IIS-0952977 and IIS-1218168 and the University of Minnesota Doctoral Disseration Fellowship. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st September 4th 2015, Kohala Coast, Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 12 Copyright 2015 VLDB Endowment 2150-8097/15/08. 2. EXPERIMENTAL SETUP All experiments run on Amazon EC2 ‘m1.large’ instances which have a dual core processor, 7.5 GB RAM and 840 GB disk storage. We use Hadoop 1.2.1 running on Java 1.6 and CentOS 6. Each machine is configured to run three mappers and two reducers. Tables 1 and 2 summarize the datasets and configuration parameters used in our experiments, respectively. Default parameters (in parentheses) are used unless otherwise mentioned. In the following part, we describe the partitioning techniques, the queries we run, and the performance metrics measured in this paper. 2.1 Partitioning Techniques This paper employs grid and Quad tree as space partitioning techniques; STR, STR+, and K-d tree as data partitioning techniques; and Z-curve and Hilbert curve as space filling curve (SFC) partitioning techniques. These techniques can also be grouped, according to boundary object handling, into replication-based techniques (i.e., Grid, Quad, STR+, and K-d tree) and distributionbased techniques (i.e., STR, Z-Cruve, and Hilbert). Figure 1 illustrates these techniques, where sample points and partition boundaries are shown as dots and rectangles, respectively. 1. Uniform Grid: This technique does not require a random sample as it divides the input MBR using a uniform grid of ⌈√n⌉ × ⌈√n⌉ grid cells and employs the replication method to handle boundary objects. 2. Quad tree: This technique inserts all sample points into a quad tree [6] with node capacity of ⌊k/n⌋, where k is the sample size. The boundaries of all leaf nodes are used as cell boundaries. We use the replication method to assign records to cells. 3. STR: This technique bulk loads the random sample into an Rtree using the STR algorithm [8] and the capacity of each node is set to ⌊k/n⌋. The MBRs of leaf nodes are used as cell boundaries. Boundary objects are handled using the distribution method where it assigns a record r to the cell with maximal overlap. 4. STR+: This technique is similar to the STR technique but it uses the replication method to handle boundary objects. 5. K-d tree: This technique uses the K-d tree [1] partitioning method to partition the space into n cells. It starts with the input MBR as one cell and partitions it n − 1 times to produce n cells. Records are assigned to cells using the replication method. 6. Z-curve: This technique sorts the sample points by their order on the Z-curve and partitions the curve into n splits, each containing roughly ⌊k/n⌋ points. It uses the distribution method to assign a record r to one cell by mapping the center point of its MBR to one of the n splits. 7. Hilbert curve: This technique is exactly the same as the Zcurve technique but it uses Hilbert space filling curve which has better spatial properties.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spatial coding-based approach for partitioning big spatial data in Hadoop

Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based...

متن کامل

Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce

Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous ...

متن کامل

A Demonstration of AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data

The ubiquity of location-aware devices, e.g., smartphones and GPS devices, has led to a plethora of location-based services in which huge amounts of geotagged information need to be efficiently processed by large-scale computing clusters. This demo presents AQWA, an adaptive and query-workload-aware data partitioning mechanism for processing large-scale spatial data. Unlike existing cluster-bas...

متن کامل

Hadoop-GIS: A High Performance Spatial Query System for Analytical Medical Imaging with MapReduce

Querying and analyzing large volumes of spatially oriented scientific data becomes increasingly important for many applications. For example, analyzing high-resolution digital pathology images through computer algorithms provides rich spatially derived information of micro-anatomic objects of human tissues. The spatial oriented information and queries at both cellular and sub-cellular scales sh...

متن کامل

Spatio-Temporal Big Data Analytics for Environmental Health

The framework for our proposed big data analytics platform is shown in Figure 1. Two complimentary systems support the wide variety of spatial analytics algorithms and techniques we are providing. On the left half of Figure 1, the more-traditional unix filesystem supports high-throughput computation (e.g., MPI [Snir et al., 1995], OpenMP [Dagum and Menon, 1998], GPGPU/CUDA Luebke et al. [2006])...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2015